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Abstract 


The C-bound, introduced in Lacasse et al. m , gives a tight upper bound on the risk 
of a binary majority vote classifier. In this work, we present a first step towards 
extending this work to more complex outputs, by providing generalizations of the 
C-bound to the multiclass and multi-label settings. 

1 Introduction 

In binary classification, many state-of-the-art algorithms output prediction functions that can be seen 
as a majority vote of “simple” classifiers. Ensemble methods such as Bagging 0, Boosting |3l and 
Random Forests II are well-known examples of learning algorithms that output majority votes. 
Majority votes are also central in the Bayesian approach (see Gelman et al. 0 for an introductory 
text); in this setting, the majority vote is generally called the Bayes Classifier. It is also -interesting 
to point out that classifiers produced by kernel methods, such as the Support Vector Machine ©, 
can also be viewed as majority votes. Indeed, to classify an example x, the S VM classifier computes 

sign (X)iSi a i Vi k{ x i,x)\. Hence, as for standard binary majority votes, if the total weight of each 

ai yi k(xi , x) that votes positive is larger than the total weight for the negative choice, the classifier 
will output a +1 label (and a —1 label in the opposite case). 

Most bounds on majority votes take into account the margin of the majority vote on an example 
(x, y), that is the difference between the total vote weight that has been given to the winning class 
minus the weight given to the alternative class. As an example, PAC-Bayesian bounds give bounds 
on majority votes classifiers by relating it to a stochastic classifier, called the Gibbs classifier which 
is, up to a linear transformation equivalent to the first statistical moment of the margin when (x, y) 
is drawn i.i.d. from a distribution 0. Unfortunately, in most ensemble methods, the voters are weak 
and no majority vote can obtain high margins. Lacasse et al. m proposed a tighter relation between 
the risk of the majority vote that take into account both the first and the second moments of the 
margin: the C-bound. This sheds a new light on the behavior of majority votes: it is not only how 
good are the voters but also how they are correlated in their voting. Namely, this has inspired a 
new learning algorithm named MinCq 0, whose performance is state-of-the-art. In this work, we 
generalize the C-bound for multiclass and multi-label weighted majority votes as a first step towards 
the goal of designing learning algorithms for more complex outputs. 

This paper is organized as follows. Section [2] recalls the C-bound in binary classification. We 
generalize it to the multiclass and multi-label settings in Sections[3]and[4] We conclude in Section[5] 
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2 The C-bound for Binary Classification 


In this section, we recall the C-bound mm in the binary classification setting. 

Let X C be the input space of dimension d, and let Y = {—1, +1} be the output space. The 
learning sample S = { (x, : . yf) }™ -| is constituted by to examples drawn i.i.d. from a fixed but un¬ 
known distribution D over X x Y. Let PL be a set of real-valued voters from X to Y. Given a prior 
distribution n on PL and given S , the goal of the PAC-Bayesian approach is to find the posterior 
distribution p onPL which minimizes the true risk of the p-weighted majority vote B p f) given by 

Ro(-Bp) = E I(B p {x) ± y) , where B p {x) = sign E h(x) , 

{x,y)~D \h~p 

and where 1(a) = 1 if predicate a is true and 0 otherwise. 

It is well-know that minimizing R d(B p ) is NP-hard. To get around this problem, one solution is to 
make use of the C-bound which is a tight bound over R d{B p ). This bound is based on the notion 
of margin of B p f) defined as follows. 

Definition 1 (the margin). Let Mjf be the random variable that, given an example (x, y) drawn 
according to D, outputs the margin of B p f) on that example, defined by M p (x , y) = y Eh~p h(x). 


We then consider the first and second statistical moments of the random variable Mjf, respectively 
given by pi (M°) = E ( x , y )~ D M p {x,y) and p 2 (M^) = (M p (x,y)) 2 . 


According to the definition of the margin, B p f) correctly classifies an example y) when its 
margin is strictly positive, i.e. R d(B p ) = Pr^ xy -\^ D ( M p (x,y ) < 0). This equality makes it 
possible to prove the following theorem. 

Theorem 1 (The C-bound of Laviolette et al. f7)). For every distribution p on a set of real-valued 
functions H, and for every distribution D on X x Y, if > 0, then we have 


B.d(B p ) < 1 


(di(Mp ) 2 

F2 (M?) 


Proof. The Cantelli-Chebyshev inequality states that any random variable Z and any a > 0, we 
have that Pr (Z < E [Z] — a) < varX+a 2 ■ We obtain the result by applying this inequality with 
Z = M p (x, y), and with a = and by using the definition of the variance. □ 

Note that the minimization of the empirical counterpart of the C-bound is a natural solution for 
learning a distribution p that leads to a p- weighted majority vote B p ( ) with low error. This strategy 
is justified thanks to an elegant PAC-Bayesian generalization bound over the C-bound, and have led 
to a simple learning algorithm called MinCq G). 

In the following, we generalize this important theoretical result in the PAC-Bayesian literature to the 
multiclass setting. 


3 Generalizations of the C-bound for Multiclass Classification 


In this section, we stand in the multiclass classification setting where the input space is still X C W l , 
but the output space is Y = {1,..., Q}, with a finite number of classes Q > 2. Let PL be a set of 
multiclass voters from X to Y. We recall that given a prior distribution n over PL and given a sample 
S, i.i.d. from D, the PAC-Bayesian approach looks for the p distribution which minimizes the true 
risk of the majority vote If, (■). In the multiclass classification setting. If, (■) is defined by 


B p ( x) = argmax 

ceY 


E I (h(x) = c ) 

h~p 


(i) 


As in binary classification, the risk R d(B p ) of a p-weighted majority vote can be related to the 
notion of margin realized on an example (x, y). However, in multiclass classification, such a notion 
can be expressed in a variety of manners. In the next section, we present three versions of multiclass 
margins that are equivalent in binary classification. 
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3.1 Margins in Multiclass Classification 


We first make use of the multiclass margin proposed by Breiman |4| for the random forests, which 
can be seen as the usual notion of margin. Note that when Y = {—1, +1}, we recover the usual 
notion of binary margin of Definition [T] 

Definition 2 (the multiclass margin). Let I? be a distribution over XxY, let B he a set of multiclass 
voters. Given a distribution p on B , the margin of the majority vote B p {-) realized on (x, y) ~ P is 

M p {x,y) = E I(/i(x) = y) — max ( E I (h(x) = c) ) . 

h~p c£Y,c^y \h~p J 

Like in the binary classification framework presented in Section[2] the majority vote B p (-) correctly 
classifies an example if its p-margin is strictly positive, i.e., R./; (~B p ) = Pr ( x , y )~D ( M P {x,y) < 0). 

We then consider the relaxation proposed by Breiman a that is based on a notion of strength of the 
majority vote in regard to a class c. 

Definition 3 (the strength). Let B be a set of multiclass voters from X to Y and let p be a dis¬ 
tribution on B. Let S^ c be the random variable that, given an example (x, y) drawn according 
to a distribution D over X X Y, outputs the strength of the majority vote B p (-) on that example 
according to a class c £ Y, defined by S p _ c (x, y) = E h~ P I {h(x) = y) — E/^p I (h(x) = c). 

From this definition, one can show that Q 

R d(B p ) = Pr (M p (x, y) < 0) < £ Pr (S p . c (x, y) < 0) - 1, (2) 

(x,y)~D “J ( x,y)~D 

where we have the equality in the binary classification setting. Lastly, we consider a relaxation of 
the notion of margin, that we call the co-margin. 

Definition 4 (the w-margin). Let B be a set of multiclass voters from X to Y, let p be a distribution 
on B and let w < 1. Let be the random variable that, given an example (x, y) ~ D over 

XxY, outputs the co-margin of the majority vote B p f) on that example, defined by 

M P,u>{x,y)= E I(h(x) = y) - l/co. (3) 


This notion of margin can be seen as the difference between the weight given by the majority vote to 
the correct class y and a certain threshold 1/t o. In the case of the binary classification, we have that 
the sign of the tu-margin with ui = 2 is the same than the sign of the binary margin. This observation 
comes from the fact that E^ p I (h(x) = y) is the proportion of voters that vote y. In the binary case, 
this proportion is <1 when the majority vote makes a mistake, and >1 otherwise. The following 
theorem relates the risk of B p {-) and the oj- margin associated to p. 

Theorem 2. Let Q > 2 be the number of classes. For every distribution D over XxY and for 
every distribution p over a set of multiclass voters B, we have 


Pr (M PtQ (x,y) < 0) < R D {B P ) < 

{x,y)~D 


Pr (M Pt2 {x,y) < 0). 

C x,y)~D 


(4) 


Proof. First, let us prove the left-hand side inequality. We have 


R d(B p ) = Pr (M p (x, y)< 0) = Pr 

{x,y)r^JJ yx,y)r*j L) 


f E I (h(x) = y) < max E I(/i(x) 

V hr^p c£Y,C^yhr^p 


> Pr 

(x,y)~D 

= Pr 


( E I(h(x) = y) < E El (h(x) = c) 

\nr^p c(E Y,c^y n^p 


E I(h(x) = y)< 


1 


E I (h(x) = y) 


= Pr (M p , Q (x,y) < 0) . 

(x,y)~D 


(x,y)<^D \hr^p ' Q — 1 

The right-hand side inequality is easily verified by observing that the majority vote necessarily 
makes a correct prediction if the weight given to the correct class y is higher than f □ 


All the above-mentioned notions of margin are equivalent if we stand in the binary classification 
setting. However, they differ in the multiclass setting. The multiclass margin of Definition [2] is 
associated to the true decision function in multiclass classification, and is calculated considering 
all other classes. The strength of Definition [3] depends on the true class y of x and corresponds 
to a combination of binary margins (one class versus another class) for c y. The w-margin of 
Definition [4] also depends on the true class y of x, but does not consider the other classes. This 
measure is easier to manipulate, but implies a higher indecision region (see Theorem[2|. 
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3.2 Generalizations of the C-bound in the Multiclass Setting 


The following bound is based on the definition of the multiclass margin in multiclass (Definition^. 
Theorem 3 (the multiclass C-bound). For every distribution p on a set of multiclass voters FI, and 
for every distribution D on X x Y, such that p-\ (Mf > ) > 0, we have 

(Mi (M ?)) 2 


R-d{B p ) < 1 — 


M2 {M?) 


Proof The proof is the same than the one of the binary C-bound (see Theorem [TJ, by considering 
the multiclass majority-vote of Equation <[T]) and the multiclass margin of Definition^ □ 


This bound offers an accurate relation between the risk of the majority vote and the margin. How¬ 
ever, the max term in the definition of the multiclass margin makes the derivation of an algorithm to 
minimize this bound much harder than in binary classification. 


Thanks to the definition of the strength of Definition [3] and according to the proof process of the 
C-bound, we obtain the following relation. 

Theorem 4. For every distribution p on a set of multiclass voters PL, and for every distribution D 
over X x Y, such thatWc £ Y, pi{S^ c ) > 0, we have 




Q 

< E 


Pr (S p , c (x, y)< 0) — 1 

(x,y)~D 


(Q- i)_y kEEll! 

k Mm 


Proof. The result is obtained by using Inequality Q in the proof of the C-bound. 


□ 


This result can be seen as a sum of C-bounds for every class. A practical drawbacks of this bound 
in order to construct a minimization algorithm is that we have to minimize a sum of ratios. Finally, 
the C-bound obtained by using the a,’-margin is given by the following theorem. 

Theorem 5. For every distribution p on a set of multiclass voters PL, for every u> > 1, and for every 
distribution D on X x Y, if p \ > 0, we have 


E 

{x,y)~D 


I (Mp, u (x,y) < 0') < 


(mi (AC,)) 2 
M2 (M£J 


Proof The result is obtained with the same proof process than the C-bound, by replacing the use of 
the random variable by Mf ’,□ 

p J P i'-B 

The x’-margin being linear, we are now able to build a bound minimization algorithm as in Laviolette 
et al. Q for the multiclass classification setting. 


4 Extending the cu-margin to the Multi-label Setting 

In this section, we will extend the w-margin with u> = 2 to the more general multi-label classification 
setting. Doing so, we will be able to upper bound the risk of the multi-label majority vote classifier. 
We stand in the multi-label classification setting where the input space is still X C the space 
of possible labels is Y = {1,..., Q} with a finite number of classes Q > 2, but we consider the 
output space Y = {0, 1}that contains vectors y of length Q where the i th element is 1 if example i 
is among the labels associated to the example x, and 0 otherwise. We consider a set PL of multi¬ 
label voters h : X i—>• Y . As usual in structured output prediction, given a distribution p over PL, the 
multi-label majority vote classifier B p chooses the label <: £ Y that has the lowest squared Euclidean 
distance with the p-weighted cumulative confidence. 


argmin 

c — E h(x) 

2 

= argmax 

c • 

(_E /i(a:)-M 

ceY 

Hr-jp 

ceY 


\h~ P 2 J 


where 1 is a vector of length Q containing ones. The multi-label margin is given by Definition[5] 
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Figure 1: Graphical representation of label y, hyperplane Py and vector Ejj h(x). 


Definition 5 (the multi-label margin). Let D be a distribution over A' x Y, let TL be a set of multi¬ 
label voters. Given a distribution p on H, the margin of the majority vote B p (-) on (x, y ) is 


M p {x,y) 





— max 

c£Y ,c^y 





As we did in the multiclass setting with the margin of Definition [2j we can upper bound the risk of 
the multi-label majority vote classifier by developing a C-bound using the margin of Definition [5] 
However, as this margin also depends on a max term, the derivation of a learning algorithm min¬ 
imizing the resulting C-bound remains hard. To overcome this, we generalize M Pi 2 , the w-margin 
with u! = 2 of Definition]?] to the multi-label setting, as follows. 

Definition 6. Let D be a distribution over X x Y, let 'H he a set of multi-label voters. Given a 
distribution p on 7~L, the 2-margin of the majority vote B p f) on (x, y) is 

= y ■ ( E Kx) - \ 1) - e h(x) ■ \ 1 - j , 

\h~p 1 ) h~ P i 4 

where % G {1,.., Q} and y t ^\/-i is obtained from y by replacing its i ,h coordinate by 1/2. 


The second equality of the definition is obtained by straightforward calculation. Now, let Py, be 
the only hyperplane on which lies all the points of the form y i ^ 1 / 2 for i = 1.... .Q. Since this 

hyperplane has normal (y — \ , it follows from basic linear algebra that if M Pi 2 > 0, then 

vectors E^ h(x) and y will be on the same side of Py. It is also easy to see that in this case, 
we have B p {x) = y. Figure flj shows an example in the case where Q = 2. Thus, we have that 
R d{B p ) < P?( x ,y)~n (M Pi 2 (x, y) < 0), and following the same arguments as in Theorem|5] one 
can derive the following multi-label C-bound. 

Theorem 6. For every distribution p on a set of multi-label voters Ti and for every distribution D 
on X x Y, if pi(M Pt 2 {x,y)) > 0, we have 


R d(B p ) < E l(M p>2 (x,y) < 0 ) 

(x,y)~D \ / 


( M i (M £ 2 )) 2 

M2 (Mj& 


5 Conclusion and Outlooks 

In this paper, we extend an important theoretical result in the PAC-Bayesian literature to the mul¬ 
ticlass and multi-label settings. Concretely, we prove three multiclass versions and one multi-label 
version of the C-bound, a bound over the risk of the majority vote, based on generalizations of 
the notion of margin for multiclass and multi-label classification. These results open the way to 
extending the theory to more complex outputs and developing new algorithms for multiclass and 
multi-label classification with PAC-Bayesian generalization guarantees. 
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